Download libraries

library(dplyr)
library(ggplot2)
library(shiny)
library(shinyWidgets)
library(gridExtra)
library(tidyverse)
library(readxl)
library(rpart)
library(rpart.plot)
library(ROSE)
library(knitr)

Part 1 - Classification

For the purpose of classification, we want to classify highest_yearly_earnings of each Youtuber. We will find the information of GDP of each country and create the target variable status based on the GDP of that YouTuber’s country.

Section 1 - Data Preparation

df <- read.csv('youtube_UTF_8.csv', na.strings = c('nan'))
df <- df[, -c(6, 9, 11:13, 15:17, 21:28)] # Remove redundant columns

highest_yearly_earnings has a high correlation to lowest_monthly_earnings, highest_monthly_earnings, and lowest_yearly_earnings, so we will remove them from this analysis. Some redundant columns such as Title and Abbreviation will be excluded. Lastly, we will remove the columns created_month, created_date, Gross.tertiary.education.enrollment...., Population, Unemployment.rate, Urban_population, Latitude, and Longitude because the last six columns are directly related to Country, so we should only keep Country.

# Function to count NAs in each columns
count_na_in_column <- function(data) {
  sapply(data, FUN = function(col) sum(is.na(col)))}
count_na_in_column(df)
##                             rank                         Youtuber 
##                                0                                0 
##                      subscribers                      video.views 
##                                0                                0 
##                         category                          uploads 
##                               46                                0 
##                          Country                     channel_type 
##                              122                               30 
## video_views_for_the_last_30_days          highest_yearly_earnings 
##                               56                                0 
##     subscribers_for_last_30_days                     created_year 
##                              337                                5

There are 122 missing entries in the Country variable, and we need to impute these values since we rely on the GDP information based on the Country. Our chosen method for imputation is to use the mode of the Country variable to fill in the missing values.

1. Impute missing values in Country

# Extract the name of the country
country_name <- df$Country %>%
  table() %>%
  names()

# Extract the name of the country with the maximum frequency
country_mode <- country_name[which.max(table(df$Country))]

# Impute missing country with mode
df <- df %>%
  mutate(Country = ifelse(is.na(Country), country_mode, Country)) 

2. Impute missing values in subscribers_for_last_30_days, video_views_for_the_last_30_days

replace_na_with_median <- function(data, columns) {
  for (col in columns) {
    x <- data[[col]]
    non_na_values <- x[!is.na(x)]
    median_value <- median(non_na_values)
    is_bad <- ifelse(is.na(x), 1, 0) # isBAD take value 1 when NA is replaced by median
    data[[col]][is.na(data[[col]])] <- median_value
    data[[paste0(col, "_isBAD")]] <- is_bad
  }
  return(data)
}

columns_to_replace <- c("subscribers_for_last_30_days", "video_views_for_the_last_30_days")
df <- replace_na_with_median(df, columns_to_replace)

3. Impute missing values in category and channel_type

replace_missing_with_mapping <- function(data, level_mapping) {
  for (i in seq_len(nrow(level_mapping))) {
    category_level <- level_mapping$category_level[i]
    channel_level <- level_mapping$channel_level[i]
    
    data$channel_type[is.na(data$channel_type) & data$category == category_level] <- channel_level
    data$category[is.na(data$category) & data$channel_type == channel_level] <- category_level
  }
  return(data)
}

# Create mapping of similar level of category and channel_type
level_mapping <- data.frame(
  category_level = c("Pets & Animals", "Autos & Vehicles", "Comedy", "Education", "Entertainment", "Film & Animation", "Gaming", "Howto & Style", "Music", "News & Politics", "Nonprofits & Activism", "People & Blogs", "Sports", "Science & Technology"),
  channel_level = c("Animals", "Autos", "Comedy", "Education", "Entertainment", "Film", "Games", "Howto", "Music", "News", "Nonprofit", "People", "Sports", "Tech"))

df <- replace_missing_with_mapping(df, level_mapping)
count_na_in_column(df)
##                                   rank                               Youtuber 
##                                      0                                      0 
##                            subscribers                            video.views 
##                                      0                                      0 
##                               category                                uploads 
##                                      3                                      0 
##                                Country                           channel_type 
##                                      0                                      4 
##       video_views_for_the_last_30_days                highest_yearly_earnings 
##                                      0                                      0 
##           subscribers_for_last_30_days                           created_year 
##                                      0                                      5 
##     subscribers_for_last_30_days_isBAD video_views_for_the_last_30_days_isBAD 
##                                      0                                      0
df <- na.omit(df)  # Omit the row with missing values

4. GDP information for each Country

gdp <- read_excel("GDP.xls", na = "no data")
gdp <- as.data.frame(gdp) # Convert tibble to dataframe
gdp <- gdp[-1, ] # Remove an empty first row
new_column_names <- c("Country", as.character(1980:2028))
gdp <- gdp %>%
  rename_with(~new_column_names, .cols = everything()) # Rename columns
# Verify the consistency of country names between the YouTube data and GDP data
country_name %in% gdp$Country %>%
  sum() - length(country_name)
## [1] -5
country_name[which(!country_name %in% gdp$Country)]
## [1] "China"       "Cuba"        "Russia"      "South Korea" "Turkey"

We found that there are 5 countries in the YouTube dataset that are not in the GDP dataset. After checking the completeness of this dataset, we have to rename 4 countries. However, this dataset does not have the GDP of Cuba, so we need to find another data source.

gdp <- gdp %>%
  mutate(Country = gsub("China, People's Republic of", "China", Country)) %>%
  mutate(Country = gsub("Korea, Republic of", "South Korea", Country)) %>%
  mutate(Country = gsub("Russian Federation", "Russia", Country)) %>%
  mutate(Country = gsub("Türkiye, Republic of", "Turkey", Country))
gdp[gdp$Country %in% levels(as.factor(df$Country)), c("Country", "2020", "2021", "2022", "2023")]
##                  Country      2020      2021      2022      2023
## 2            Afghanistan   611.268        NA        NA        NA
## 5                Andorra 36973.845 41873.060 41931.030 44386.988
## 8              Argentina  8571.937 10616.947 13655.200 13709.490
## 11             Australia 53071.718 63896.297 65526.118 64964.282
## 16            Bangladesh  2270.348  2497.741  2730.846  2469.577
## 17              Barbados 16379.468 16864.318 19579.156 21085.944
## 26                Brazil  6970.719  7754.611  8995.029  9673.093
## 34                Canada 43383.713 52387.812 55085.452 52722.484
## 37                 Chile 13067.742 16059.804 15094.829 17827.414
## 38                 China 10525.001 12572.071 12813.765 13721.051
## 39              Colombia  5363.072  6239.274  6664.269  6417.047
## 52               Ecuador  5670.330  5978.915  6462.217  6642.700
## 53                 Egypt  3802.438  4145.939  4563.298  3644.255
## 54           El Salvador  3903.414  4551.161  4987.903  5308.092
## 61               Finland 49168.162 53595.186 50655.133 54351.246
## 62                France 40385.395 45185.845 42409.045 44408.417
## 66               Germany 46735.314 51237.643 48636.030 51383.504
## 79                 India  1913.220  2234.337  2379.209  2601.362
## 80             Indonesia  3932.332  4362.677  4798.119  5016.638
## 82                  Iraq  4219.874  5015.489  6399.688  6180.517
## 85                 Italy 31784.787 35842.069 34113.201 36812.317
## 87                 Japan 40117.916 39882.554 33821.931 35385.073
## 88                Jordan  4336.382  4460.880  4740.941  5048.459
## 92           South Korea 31728.304 34997.985 32250.409 33393.074
## 94                Kuwait 22683.638 28884.300 38328.894 33646.139
## 97                Latvia 18123.655 20996.889 22347.970 25136.162
## 107             Malaysia 10361.276 11449.782 12364.058 13382.408
## 114               Mexico  8533.496  9869.076 10867.807 12673.634
## 119              Morocco  3375.265  3934.310  3764.720  3748.598
## 125          Netherlands 52222.364 57996.912 56489.068 61098.871
## 133             Pakistan  1376.512  1564.434  1658.363        NA
## 138                 Peru  6145.031  6678.850  7094.448  7772.959
## 139          Philippines  3325.836  3576.101  3623.389  3905.483
## 145               Russia 10180.670 12617.864 15443.829 14403.573
## 150                Samoa  4371.767  4224.921  4140.002  4436.908
## 152         Saudi Arabia 20971.147 25463.654 31849.729 29922.089
## 157            Singapore 61274.006 77710.070 82807.649 91100.374
## 164                Spain 26943.767 30133.937 29420.615 31223.354
## 168               Sweden 52706.294 60929.619 55689.403 55395.444
## 169          Switzerland 85872.732 92238.988 92371.449 98767.334
## 175             Thailand  7170.849  7226.837  7650.884  8181.928
## 183               Turkey  8612.310  9654.085 10618.285 11931.502
## 185              Ukraine  3780.075  4882.111  4348.570  4653.892
## 186 United Arab Emirates 37648.963 43422.085 51305.686 49451.635
## 187       United Kingdom 40347.364 46421.611 45294.805 46371.452
## 188        United States 63577.341 70159.774 76348.494 80034.581
## 192            Venezuela  1566.623  2071.603  3459.154  3640.812
## 193              Vietnam  3548.892  3753.428  4086.519  4475.507

46 countries have the latest (2023) GDP data, except for Pakistan and Afghanistan. Therefore, we will use the most recent available GDP figures, which are from 2022 for Pakistan and 2020 for Afghanistan.

gdp2023 <- gdp[, c(1, 45)]
gdp2022 <- subset(gdp, Country == "Pakistan", select = c("Country", "2022"))
gdp2020 <- subset(gdp, Country == "Afghanistan", select = c("Country", "2020"))

We joined the GDP dataset to the YouTube dataset using Country as the key. We used a left_join() because there is still missing GDP data for Cuba.

df_gdp <- left_join(df, gdp2023, by = "Country")
df_gdp <- df_gdp %>%
  rename(GDP = `2023`) # Rename GDP column
df_gdp$GDP <- ifelse(df_gdp$Country == "Afghanistan", gdp2020[1, 2],
                    ifelse(df_gdp$Country == "Pakistan", gdp2022[1, 2], df_gdp$GDP))
worldbank <- read_excel("gdp_worldbank.xls")
worldbank <- worldbank[-c(1:2), ] %>%
  setNames(.[1, ]) # Make the first row as column names
worldbank <- worldbank[-1, ] %>%
  as.data.frame()

We used another data source for the GDP of Cuba, called “gdp_worldbank.xls”. The latest available GDP for Cuba is from 2020. We have now completed the process of finding GDP for every country in the YouTube dataset.

subset(worldbank, `Country Name` == "Cuba", select = c(`2020`, `2021`, `2022`))
##                  2020 2021 2022
## 51 9499.5902023043182 <NA> <NA>
cuba_gdp <- subset(worldbank, `Country Name` == "Cuba", select = c(`Country Name`, `2020`))
df_gdp$GDP <- ifelse(df_gdp$Country == "Cuba", cuba_gdp[1, 2], df_gdp$GDP)
sum(is.na(df_gdp$GDP)) # All GDP is inserted
## [1] 0
rm(gdp2020, gdp2022, gdp2023, cuba_gdp, gdp, worldbank) # Remove unused dataframe

5. Convert dataframe to the appropriate structure

df_gdp <- df_gdp %>%
  mutate(GDP = as.numeric(GDP)) %>%
  mutate(created_year = as.factor(created_year)) %>%
  mutate(subscribers_for_last_30_days_isBAD = as.factor(subscribers_for_last_30_days_isBAD)) %>%
  mutate(video_views_for_the_last_30_days_isBAD = as.factor(video_views_for_the_last_30_days_isBAD))

6. Create target variable status

df_gdp$status <- ifelse(df_gdp$GDP < df_gdp$highest_yearly_earnings, 1, 0)
table(df_gdp$status)
## 
##   0   1 
## 169 817

The target variable status is created with 1 indicating high income earning and 0 indicating low income earning. There are 817 high income-earning YouTubers against 169 low income-earning ones. This is not balanced; therefore, we will use the ROSE package for the generation of synthetic data by Randomly Oversampling Examples to balance the status variable.

7. Balancing traget variable status

df_gdp.rose <- df_gdp[, -c(1:2, 8, 10, 15, 17)]

# Convert the Country column to a be numeric column
country_mapping <- c(
  "Afghanistan" = 1, "Andorra" = 2, "Argentina" = 3, "Australia" = 4,
  "Bangladesh" = 5, "Barbados" = 6, "Brazil" = 7, "Canada" = 8,
  "Chile" = 9, "China" = 10, "Colombia" = 11, "Cuba" = 12,
  "Ecuador" = 13, "Egypt" = 14, "El Salvador" = 15, "Finland" = 16,
  "France" = 17, "Germany" = 18, "India" = 19, "Indonesia" = 20,
  "Iraq" = 21, "Italy" = 22, "Japan" = 23, "Jordan" = 24,
  "Kuwait" = 25, "Latvia" = 26, "Malaysia" = 27, "Mexico" = 28,
  "Morocco" = 29, "Netherlands" = 30, "Pakistan" = 31, "Peru" = 32,
  "Philippines" = 33, "Russia" = 34, "Samoa" = 35, "Saudi Arabia" = 36,
  "Singapore" = 37, "South Korea" = 38, "Spain" = 39, "Sweden" = 40,
  "Switzerland" = 41, "Thailand" = 42, "Turkey" = 43, "Ukraine" = 44,
  "United Arab Emirates" = 45, "United Kingdom" = 46, "United States" = 47,
  "Venezuela" = 48, "Vietnam" = 49)
df_gdp.rose$Country <- country_mapping[df_gdp.rose$Country]
df_gdp.rose$Country <- as.factor(df_gdp.rose$Country)

# Convert the category column to a be numeric column
category_mapping <- c(
  "Autos & Vehicles" = 1, "Comedy" = 2, "Education" = 3, "Entertainment" = 4,
  "Film & Animation" = 5, "Gaming" = 6, "Howto & Style" = 7, "Movies" = 8,
  "Music" = 9, "News & Politics" = 10, "Nonprofits & Activism" = 11, "People & Blogs" = 12,
  "Pets & Animals" = 13, "Science & Technology" = 14, "Shows" = 15, "Sports" = 16,
  "Trailers" = 17, "Travel & Events" = 18)
df_gdp.rose$category <- category_mapping[df_gdp.rose$category]
df_gdp.rose$category <- as.factor(df_gdp.rose$category)

df.rose <- ROSE(status ~ ., data = df_gdp.rose, seed=123)$data
table(df.rose$status)
## 
##   0   1 
## 488 498
df.rose$status <- as.factor(df.rose$status)

ROSE handles only continuous and categorical variables, so we converted Country and category into numeric ones. After applying ROSE, we achieved a balanced dataset.

8. Split Train/Test dataset

# a 90/10 split to form the training and test sets
set.seed(729375)
df.rose$rgroup <- runif(dim(df.rose)[1])
dTrainAll <- subset(df.rose, rgroup<=0.9)
dTest <- subset(df.rose, rgroup>0.9)
outcome <- c('status')
pos <- '1'

# names of columns that are categorical type and numerical type 
vars <- colnames(df.rose)[!(colnames(df.rose) %in% c("status", "rgroup"))]
catVars <- vars[sapply(dTrainAll[, vars], class) %in% c('factor', 'character')]
numericVars <- vars[sapply(dTrainAll[, vars], class) %in% c('numeric', 'integer')] 

# split dTrainAll into a training set and a validation (or calibration) set
useForCal <- rbinom(n=dim(dTrainAll)[1], size=1, prob=0.1)>0 
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll, !useForCal)
cat("Country list in dTrain:")
## Country list in dTrain:
table(dTrain$Country)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   4   2  11  11   3   2  39  12   9   1   4   6   2   2   0   0   2   5 116  16 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   0   5   0   0   0   0  30   1   2   5   0   6  12   2   3   6   9  22   0 
##  41  42  43  44  45  46  47  48  49 
##   0  13   3   5   3  37 403   1   0
cat("Country list in dTest:")
## Country list in dTest:
table(dTest$Country)
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  0  1  1  2  0  0  3  0  0  0  0  1  0  0  0  0  1  0 10  3  0  0  1  0  0  0 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
##  1  2  0  0  2  0  1  0  0  0  1  1  6  0  0  2  0  0  0  3 55  0  0
dTest <- dTest %>% filter(Country != 27)

At country level 27 (Malaysia), the training data does not include that level, while the test data does. This mismatch could potentially cause issues during predictions on the test dataset. Therefore, we need to exclude that observation from the test data.

Section 2 - Single Variable Model

# Function to calculate predictions for categorical variables
mkPredC <- function(outCol, varCol, appCol) {
  pPos <- sum(outCol == pos) / length(outCol)
  naTab <- table(as.factor(outCol[is.na(varCol)]))
  pPosWna <- (naTab/sum(naTab))[pos]
  vTab <- table(as.factor(outCol), varCol)
  pPosWv <- (vTab[pos, ] + 1.0e-3*pPos) / (colSums(vTab) + 1.0e-3)
  pred <- pPosWv[appCol]
  pred[is.na(appCol)] <- pPosWna
  pred[is.na(pred)] <- pPos
  pred
}
for(v in catVars) {
  pi <- paste('pred', v, sep='')
  dTrain[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dTrain[,v])
  dCal[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dCal[,v])
  dTest[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dTest[,v])
}
# Function to evaluate prediction through AUC score
library('ROCR')
calcAUC <- function(predcol,outcol) {
  perf <- performance(prediction(predcol,outcol==pos),'auc')
  as.numeric(perf@y.values)
}
for(v in catVars) {
  pi <- paste('pred', v, sep='')
  aucTrain <- calcAUC(dTrain[,pi], dTrain[,outcome])
  if (aucTrain >= 0.5) {
    aucCal <- calcAUC(dCal[,pi], dCal[,outcome])
    print(sprintf("%s: trainAUC: %4.3f; calibrationAUC: %4.3f", pi, aucTrain, aucCal))
    }
}
## [1] "predcategory: trainAUC: 0.589; calibrationAUC: 0.626"
## [1] "predCountry: trainAUC: 0.734; calibrationAUC: 0.676"
## [1] "predcreated_year: trainAUC: 0.695; calibrationAUC: 0.700"
## [1] "predsubscribers_for_last_30_days_isBAD: trainAUC: 0.639; calibrationAUC: 0.709"
## [1] "predvideo_views_for_the_last_30_days_isBAD: trainAUC: 0.632; calibrationAUC: 0.640"

The AUC results on the training and calibration dataset represent the ability of a model to distinguish between the positive and negative classes. An AUC of 0.5 indicates that the model has no discriminatory power and is equivalent to random guessing. All the categorical predictors have AUCs greater than 0.5, meaning all of them have some level of discriminatory power. The predictor with the best discriminatory power on the training data is predCountry with an AUC of 0.734. While the predictor predsubscribers_for_last_30_days_isBAD has the best performance on unseen data with an AUC of 0.709.

# 100 fold cross validation
for (v in catVars) {
  aucs <- rep(0,100)
  for (rep in 1:length(aucs)) {
    useForCalRep <- rbinom(n=nrow(dTrainAll), size=1, prob=0.1) > 0
    predRep <- mkPredC(dTrainAll[!useForCalRep, outcome],
                       dTrainAll[!useForCalRep, v],
                       dTrainAll[useForCalRep, v])
    aucs[rep] <- calcAUC(predRep, dTrainAll[useForCalRep, outcome])
    }
  print(sprintf("%s: mean: %4.3f; sd: %4.3f", v, mean(aucs), sd(aucs)))
  }
## [1] "category: mean: 0.558; sd: 0.053"
## [1] "Country: mean: 0.700; sd: 0.051"
## [1] "created_year: mean: 0.673; sd: 0.054"
## [1] "subscribers_for_last_30_days_isBAD: mean: 0.646; sd: 0.048"
## [1] "video_views_for_the_last_30_days_isBAD: mean: 0.637; sd: 0.035"

In order to gain a more robust performance metric, we apply the 100 fold cross validation where the dataset is split multiple times so we can assess how the model would perform on various different portions of the data and get a better understanding of how the model might perform on unseen data. From the results above, the predictor Country shows the strongest discriminatory power with a mean AUC of 0.70. Overall, all predictors performed consistently as evidenced by the relatively small standard deviations.

# Double density plot
str(factor(dTrain[,"category"]))
##  Factor w/ 17 levels "1","2","3","4",..: 2 9 12 6 4 9 5 4 6 4 ...
str(factor(dTrain[,"Country"]))
##  Factor w/ 38 levels "1","2","3","4",..: 17 7 37 36 17 17 37 17 17 37 ...
str(factor(dTrain[,"created_year"]))
##  Factor w/ 18 levels "2005","2006",..: 16 10 13 11 16 2 4 10 13 14 ...
fig1 <- ggplot(dCal) + geom_density(aes(x=predcategory, color=as.factor(status)))
fig2 <- ggplot(dCal) + geom_density(aes(x=predCountry, color=as.factor(status)))
fig3 <- ggplot(dCal) + geom_density(aes(x=predcreated_year, color=as.factor(status)))
grid.arrange(fig1, fig2, fig3, ncol=2)

In order to visualize the distribution of the predicted values between the two classes, we created a density plot for each variable. The sharp peak in the blue curve of the predcreated_year vs. status plot, signifies a dominant presence of high earners compared to low earners especially when its around the value of 0.5. In contrast, there is a significant amount of high earners and low earners in the predcategory vs. status and predCountry vs. status plot where the value is around 0.4-0.5 as evidenced by the sharp peaks at these points. However, these two predictors have a significant overlap which makes it more complex to differentiate between positive and negative classes.

# ROC Curve
library(ROCit)
plot_roc <- function(predcol, outcol, colour_id=2, overlaid=F) {
  ROCit_obj <- rocit(score=predcol, class=outcol==pos)
  par(new=overlaid)
  plot(ROCit_obj, col = c(colour_id, 1),
  legend = FALSE, YIndex = FALSE, values = FALSE)
}
plot_roc(dTest$predcategory, dTest[,outcome]) # red
plot_roc(dTest$predCountry, dTest[,outcome], colour_id=3, overlaid=T) # green
plot_roc(dTest$predcreated_year, dTest[,outcome], colour_id=5, overlaid=T) # sky blue

The ROC plot above provides a visual analysis to examine the trade-off between the true positive rate and the false positive rate of the classifier. The predcreated_year model has little to no ability to discriminate between the positive and negative classes as it closely follows the dashed line, which represents no discriminatory power. On the other hand, the model being represented by the green curve predCountry has a good discriminative ability, particularly at thresholds where the curve is steepest. It is considerably better than random guessing due to its substantial departure from the diagonal line, which means that it has a lower false positive rate than a random classifier.

# Function to calculate predictions for numerical variables
mkPredN <- function(outCol, varCol, appCol) {
  # compute the cuts
  cuts <- unique(
  quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T))
  # discretize the numerical columns
  varC <- cut(varCol,cuts)
  appC <- cut(appCol,cuts)
  mkPredC(outCol,varC,appC)
  }
for (v in numericVars) {
  pi <- paste('pred', v, sep='')
  dTrain[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dTrain[,v])
  dCal[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dCal[,v])
  dTest[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dTest[,v])
  aucTrain <- calcAUC(dTrain[,pi], dTrain[,outcome])
  if(aucTrain >= 0.55) {
    aucCal <- calcAUC(dCal[,pi], dCal[,outcome])
    print(sprintf("%s: trainAUC: %4.3f; calibrationAUC: %4.3f", pi, aucTrain, aucCal))
  }
  }
## [1] "predsubscribers: trainAUC: 0.584; calibrationAUC: 0.571"
## [1] "predvideo.views: trainAUC: 0.591; calibrationAUC: 0.572"
## [1] "preduploads: trainAUC: 0.843; calibrationAUC: 0.815"
## [1] "predvideo_views_for_the_last_30_days: trainAUC: 0.728; calibrationAUC: 0.718"
## [1] "predsubscribers_for_last_30_days: trainAUC: 0.777; calibrationAUC: 0.789"

All the numerical predictors have AUCs greater than 0.5, meaning all of them have some level of discriminatory power. The predictor with the strongest discriminatory power in the training data is preduploads with an AUC of 0.843. Notably, its performance on unseen data stands out with an AUC of 0.815. Overall, this predictor consistently demonstrate robust capabilities.

for (v in numericVars) {
aucs <- rep(0,100)
for (rep in 1:length(aucs)) {
useForCalRep <- rbinom(n=nrow(dTrainAll), size=1, prob=0.1) > 0
predRep <- mkPredN(dTrainAll[!useForCalRep, outcome],
dTrainAll[!useForCalRep, v],
dTrainAll[useForCalRep, v])
aucs[rep] <- calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}
print(sprintf("%s: mean: %4.3f; sd: %4.3f", v, mean(aucs), sd(aucs)))
}
## [1] "subscribers: mean: 0.555; sd: 0.057"
## [1] "video.views: mean: 0.572; sd: 0.062"
## [1] "uploads: mean: 0.831; sd: 0.044"
## [1] "video_views_for_the_last_30_days: mean: 0.702; sd: 0.060"
## [1] "subscribers_for_last_30_days: mean: 0.776; sd: 0.044"

After conducting a 100-fold cross-validation on the numerical predictors, uploads demonstrates the strongest ability to differentiate between classes, with a mean of 0.831 and a very small standard deviation of 0.044 while all predictors show reasonable performance consistency across the 100 cross-validation folds.

fig1 <- ggplot(dCal) + geom_density(aes(x=predsubscribers, color=as.factor(status)))
fig2 <- ggplot(dCal) + geom_density(aes(x=predvideo.views, color=as.factor(status)))
fig3 <- ggplot(dCal) + geom_density(aes(x=preduploads, color=as.factor(status)))
fig4 <- ggplot(dCal) + geom_density(aes(x=predvideo_views_for_the_last_30_days, color=as.factor(status)))
fig5 <- ggplot(dCal) + geom_density(aes(x=predsubscribers_for_last_30_days, color=as.factor(status)))

grid.arrange(fig1, fig2, fig3, fig4, fig5, ncol=2)

The plots for predsubscribers and predvideo.views exhibit overlapping curves, which highlights the complexity in differentiating between high and low earning channels based solely on these individual variables. This overlap demonstrates the importance of considering additional variables or features for a more accurate classification. In contrast, the preduploads graph distinctly underscores its efficacy in discerning between the two classes. With its blue curve peaking around the 0.8 mark, it suggests that channels with high predicted upload counts often correlate with high earnings. Shifting the focus to the predvideo_views_for_the_last_30_days plot, a peak in the blue curve around the 0.6 mark demonstrates a clear association between channels that have gained a number of predicted video views in the last 30 days and higher earnings. Furthermore, the predsubscribers_for_last_30_days graph illustrates the trend where channels with a notable increase in subscribers, evidenced by the peak around the 0.8 mark, suggests high earnings.

calcAUC(dTrain[,"predsubscribers"], dTrain[,outcome]); calcAUC(dCal[,"predvideo.views"], dCal[,outcome]); calcAUC(dCal[,"preduploads"], dCal[,outcome]); calcAUC(dCal[,"predvideo_views_for_the_last_30_days"], dCal[,outcome]); calcAUC(dCal[,"predsubscribers_for_last_30_days"], dCal[,outcome])
## [1] 0.5839665
## [1] 0.5717054
## [1] 0.8147287
## [1] 0.7182171
## [1] 0.7887597
plot_roc(dTest$predsubscribers, dTest[,outcome]) #red
plot_roc(dTest$predvideo.views, dTest[,outcome], colour_id=3, overlaid=T) #green
plot_roc(dTest$preduploads, dTest[,outcome], colour_id=4, overlaid=T) #blue
plot_roc(dTest$predvideo_views_for_the_last_30_days, dTest[,outcome], colour_id=5, overlaid=T) #skyblue
plot_roc(dTest$predsubscribers_for_last_30_days, dTest[,outcome], colour_id=6, overlaid=T) #purple

df.roc <- dTest

Based on the insights derived from the ROC plot for the numerical variables, it’s evident that each variable possesses a distinct discriminatory capability. The predictor predvideo.views, symbolized by the green curve, exhibits moments where it follows the dashed line if not slightly below it. In contrast, both preduploads and predsubscribers_for_last_30_days are standout predictors. Their curves, prominently lean towards the left side of the plot, which signifies a performance that is exceptionally better than random guessing. This pronounced deviation from the dashed line highlights the heightened true positive rate of these two predictors compared to a random classifier.

Section 3 - Feature Selection

We employed two techniques for feature selection: the Boruta package and the Fisher score. As the Fisher score requires numeric features, we converted all features to numeric format and applied both techniques to ensure consistency.

df.rose <- df.rose %>%
  mutate(subscribers_for_last_30_days_isBAD  = as.numeric(subscribers_for_last_30_days_isBAD)) %>%
  mutate(video_views_for_the_last_30_days_isBAD  = as.numeric(video_views_for_the_last_30_days_isBAD)) %>%
  mutate(created_year  = as.numeric(created_year)) %>%
  mutate(Country  = as.numeric(Country)) %>%
  mutate(category  = as.numeric(category))

To ensure that the training, calibration, and test sets remain consistent with the conversion of features to numeric format, we regenerated the split from the updated data.

# a 90/10 split to form the training and test sets.
set.seed(729375)
df.rose$rgroup <- runif(dim(df.rose)[1])
dTrainAll <- subset(df.rose, rgroup<=0.9)
dTest <- subset(df.rose, rgroup>0.9)
outcome <- c('status')
pos <- "1"

# names of columns that are categorical type and numerical type
vars <- colnames(df.rose)[!(colnames(df.rose) %in% c("status", "rgroup"))]
catVars <- vars[sapply(dTrainAll[, vars], class) %in% c('factor', 'character')]
numericVars <- vars[sapply(dTrainAll[, vars], class) %in% c('numeric', 'integer')]

# split dTrainAll into a training set and a validation (or calibration) set
useForCal <- rbinom(n=dim(dTrainAll)[1], size=1, prob=0.1)>0
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll, !useForCal)

1. Boruta

library(Boruta)
selected_features <- df.rose[, c("status", vars), drop = FALSE]
boruta_result <- Boruta(status ~ ., data = selected_features, doTrace = 0)
roughFixMod <- TentativeRoughFix(boruta_result)
## Warning in TentativeRoughFix(boruta_result): There are no Tentative attributes!
## Returning original object.
boruta_signif <- getSelectedAttributes(roughFixMod)
attStats(boruta_result)
##                                          meanImp medianImp   minImp    maxImp
## subscribers                             6.349151  6.414727  4.45378  7.938653
## video.views                             7.748980  8.309504  5.33165  9.070138
## category                               10.404167 10.210467  9.53521 11.594118
## uploads                                44.667995 44.902370 41.13901 46.418478
## Country                                12.799833 12.539001 11.91663 13.988078
## video_views_for_the_last_30_days       34.649011 34.721749 31.81485 36.659204
## subscribers_for_last_30_days           51.156220 51.085080 48.76587 54.037692
## created_year                           16.262215 16.043087 15.24773 18.496473
## subscribers_for_last_30_days_isBAD     19.285774 19.190617 17.99441 20.706737
## video_views_for_the_last_30_days_isBAD 32.567822 32.938428 30.74729 33.847957
##                                        normHits  decision
## subscribers                                   1 Confirmed
## video.views                                   1 Confirmed
## category                                      1 Confirmed
## uploads                                       1 Confirmed
## Country                                       1 Confirmed
## video_views_for_the_last_30_days              1 Confirmed
## subscribers_for_last_30_days                  1 Confirmed
## created_year                                  1 Confirmed
## subscribers_for_last_30_days_isBAD            1 Confirmed
## video_views_for_the_last_30_days_isBAD        1 Confirmed

All the features listed in the table have been marked as “Confirmed”, which indicates that all of the predictors are significant for predicting the status outcome.

plot(boruta_result, cex.axis=.7, las = 2, xlab = "", main = "Variable Importance")

We decided to use the top seven predictors generated by the Boruta technique for the first combination of attributes, excluding the predictors category, video.views, and subscribers.

2. Fisher Score

# Fisher score
numericVars <- vars[sapply(df.rose[, vars], class) %in% c('numeric', 'integer')]
selected_numfeatures <- df.rose[, numericVars, drop = FALSE]
target_variable <- as.factor(df.rose$status)

# Function to calculate Fisher's Score for a single feature
fisher_score <- function(feature, target) {
  means <- tapply(feature, target, mean)
  variances <- tapply(feature, target, var)
  between_class_variance <- sum(table(target) * (means - mean(feature))^2) / (length(target) - 1)
  within_class_variance <- sum((table(target) - 1) * variances) / (length(target) - length(unique(target)))
  fisher_score <- between_class_variance / within_class_variance
  return(fisher_score)
}

# Calculate Fisher's Scores for all features
fisher_scores <- sapply(selected_numfeatures, function(feature) fisher_score(feature, target_variable))
kable(fisher_scores)
x
subscribers 0.0003266
video.views 0.0076178
category 0.0012865
uploads 0.0219873
Country 0.0311487
video_views_for_the_last_30_days 0.0019176
subscribers_for_last_30_days 0.0441089
created_year 0.0001115
subscribers_for_last_30_days_isBAD 0.0996885
video_views_for_the_last_30_days_isBAD 0.1793985

The Fisher scores provide a measure of how well each numerical feature can discriminate between high earners and low earners. The predictors video_views_for_the_last_30_days_isBAD, subscribers_for_last_30_days_isBAD, and subscribers_for_last_30_days are some of the features with relatively higher discriminatory power, among all other predictors.

fisher_data <- data.frame(Feature = colnames(selected_numfeatures), Score = fisher_scores)
ggplot(fisher_data, aes(x = Feature, y = Score)) +
  geom_bar(stat = "identity", fill = "gray") +
  labs(title = "Fisher's Scores", x = "Features", y = "Scores") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We have opted to retain the same features as in the initial attribute combination, excluding created_year and substituting it with video.views.

Section 4 - Decision Tree

Based on the results of the feature selection process, we have two sets of attributes as follows. We will use decision trees on these two sets.

boruta.vars <- colnames(df.rose)[(colnames(df.rose) %in% c("video_views_for_the_last_30_days_isBAD", "video_views_for_the_last_30_days", "subscribers_for_last_30_days", "subscribers_for_last_30_days_isBAD", "uploads", "Country", "created_year"))]
# boruta.vars <- colnames(df.rose)[(colnames(df.rose) %in% c("subscribers_for_last_30_days", "uploads"))]
boruta.formula <- paste('as.factor(status) ~ ',
            paste(c(boruta.vars), collapse=' + '),
            sep='')
dt.boruta <- rpart(formula = boruta.formula, data = dTrain)

fisher.vars <- colnames(df.rose)[(colnames(df.rose) %in% c("video_views_for_the_last_30_days_isBAD", "subscribers_for_last_30_days_isBAD", "video_views_for_the_last_30_days", "subscribers_for_last_30_days", "uploads", "Country", "video.views"))]
fisher.formula <- paste('as.factor(status) ~ ',
            paste(c(fisher.vars), collapse=' + '),
            sep='')
dt.fisher <- rpart(formula = fisher.formula, data = dTrain)
rpart.plot(dt.boruta)
rpart.plot(dt.fisher)

We can observe that the decision tree plots are identical for these two sets of attributes. However, the variable importance differs between the two sets. In the Boruta feature set, the important variables are uploads, subscribers_for_last_30_days, and video_views_for_the_last_30_days, with importance values of 53, 43, and 4, respectively. On the other hand, in the Fisher feature set, the important variables are uploads, subscribers_for_last_30_days, video_views_for_the_last_30_days, and video.views, with importance values of 52, 42, 4, and 2, respectively.

predicted_train <- predict(dt.boruta, newdata=dTrain[boruta.vars], type = "class")
predicted_test <- predict(dt.boruta, newdata=dTest[boruta.vars], type = "class")
predicted_cal <- predict(dt.boruta, newdata=dCal[boruta.vars], type = "class")

calculate_accuracy <- function(T1) {
  diagonal_sum <- sum(diag(T1))
  total_sum <- sum(T1)
  accuracy <- diagonal_sum/total_sum
  return(accuracy)
}

table(dTrain$status, predicted_train)
##    predicted_train
##       0   1
##   0 358  35
##   1  65 358
table(dTest$status, predicted_test)
##    predicted_test
##      0  1
##   0 49  3
##   1  4 41
table(dCal$status, predicted_cal)
##    predicted_cal
##      0  1
##   0 38  5
##   1  6 24
calculate_accuracy(table(dTrain$status, predicted_train))
## [1] 0.877451
calculate_accuracy(table(dTest$status, predicted_test))
## [1] 0.9278351
calculate_accuracy(table(dCal$status, predicted_cal))
## [1] 0.8493151
predicted_train <- predict(dt.fisher, newdata=dTrain[fisher.vars], type = "class")
predicted_test <- predict(dt.fisher, newdata=dTest[fisher.vars], type = "class")
predicted_cal <- predict(dt.fisher, newdata=dCal[fisher.vars], type = "class")

calculate_accuracy <- function(T1) {
  diagonal_sum <- sum(diag(T1))
  total_sum <- sum(T1)
  accuracy <- diagonal_sum / total_sum
  return(accuracy)
}

table(dTrain$status, predicted_train)
##    predicted_train
##       0   1
##   0 358  35
##   1  65 358
table(dTest$status, predicted_test)
##    predicted_test
##      0  1
##   0 49  3
##   1  4 41
table(dCal$status, predicted_cal)
##    predicted_cal
##      0  1
##   0 38  5
##   1  6 24
calculate_accuracy(table(dTrain$status, predicted_train))
## [1] 0.877451
calculate_accuracy(table(dTest$status, predicted_test))
## [1] 0.9278351
calculate_accuracy(table(dCal$status, predicted_cal))
## [1] 0.8493151

As anticipated, the confusion matrix and accuracy are consistent for these two sets of attributes. The model achieves an accuracy of approximately 93%, which is notably high.

Section 5 - Logistic Regression

1. Logistic Regression

# Boruta-selected model 
response <- outcome
features <- boruta.vars
formula.boruta <- as.formula(paste(response, paste(features, collapse=" + "), sep=" ~ "))
model.boruta <- glm(formula.boruta, data=dTrain, family=binomial(link="logit"))
summary(model.boruta)
## 
## Call:
## glm(formula = formula.boruta, family = binomial(link = "logit"), 
##     data = dTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1640  -1.0309   0.1951   0.9132   1.9855  
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             2.112e+01  6.087e+02   0.035   0.9723
## uploads                                 9.071e-06  3.929e-06   2.309   0.0210
## Country                                -2.486e-02  5.611e-03  -4.430 9.42e-06
## video_views_for_the_last_30_days       -3.051e-10  1.393e-10  -2.191   0.0285
## subscribers_for_last_30_days            1.943e-06  3.599e-07   5.397 6.76e-08
## created_year                           -2.614e-02  1.863e-02  -1.403   0.1606
## subscribers_for_last_30_days_isBAD     -9.422e-01  1.701e-01  -5.538 3.06e-08
## video_views_for_the_last_30_days_isBAD -1.864e+01  6.087e+02  -0.031   0.9756
##                                           
## (Intercept)                               
## uploads                                *  
## Country                                ***
## video_views_for_the_last_30_days       *  
## subscribers_for_last_30_days           ***
## created_year                              
## subscribers_for_last_30_days_isBAD     ***
## video_views_for_the_last_30_days_isBAD    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1130.11  on 815  degrees of freedom
## Residual deviance:  846.03  on 808  degrees of freedom
## AIC: 862.03
## 
## Number of Fisher Scoring iterations: 17
dTrain$logpred1 <- predict(model.boruta, newdata=dTrain, type="response")
dTest$logpred1 <- predict(model.boruta, newdata=dTest, type="response")
# Fisher's Score-selected model 
response <- outcome
features <- fisher.vars
formula.fisher <- as.formula(paste(response, paste(features, collapse=" + "), sep=" ~ "))
model.fisher <- glm(formula.fisher, data=dTrain, family="binomial")
summary(model.fisher)
## 
## Call:
## glm(formula = formula.fisher, family = "binomial", data = dTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1687  -1.0238   0.1864   0.9101   1.9784  
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             2.074e+01  6.118e+02   0.034   0.9730
## video.views                             4.290e-12  7.160e-12   0.599   0.5490
## uploads                                 9.299e-06  3.924e-06   2.370   0.0178
## Country                                -2.473e-02  5.613e-03  -4.406 1.05e-05
## video_views_for_the_last_30_days       -3.027e-10  1.392e-10  -2.175   0.0296
## subscribers_for_last_30_days            1.887e-06  3.616e-07   5.220 1.79e-07
## subscribers_for_last_30_days_isBAD     -9.241e-01  1.701e-01  -5.432 5.57e-08
## video_views_for_the_last_30_days_isBAD -1.858e+01  6.118e+02  -0.030   0.9758
##                                           
## (Intercept)                               
## video.views                               
## uploads                                *  
## Country                                ***
## video_views_for_the_last_30_days       *  
## subscribers_for_last_30_days           ***
## subscribers_for_last_30_days_isBAD     ***
## video_views_for_the_last_30_days_isBAD    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1130.11  on 815  degrees of freedom
## Residual deviance:  847.64  on 808  degrees of freedom
## AIC: 863.64
## 
## Number of Fisher Scoring iterations: 17
dTrain$logpred2 <- predict(model.fisher, newdata=dTrain, type="response")
dTest$logpred2 <- predict(model.fisher, newdata=dTest, type="response")

Based on the summary of both logistic regression models, it is evident that they share similar coefficients and significance levels for the common predictor variables. Two predictors, uploads and subscribers_for_last_30_days, show a positive association with a higher likelihood of being in the positive class while subscribers_for_last_30_days_isBAD is strongly associated with the negative class, signifying its strong predictive power for this class. On the other hand, the predictor Country is statistically significant and its negative coefficient suggests that there are specific countries that have a higher likelihood of belonging to the positive class than others. Furthermore, both the Boruta-selected model and the Fisher’s Score-selected model exhibit minimal differences in residual deviance and AIC scores. This similarity in performance metrics suggests that both models have a comparable level of complexity and are performing similarly well in predicting the outcome.

1. Logistic Regression Performance Evaluation

In this section, we define and describe several functions used for performance evaluation of logistic regression models. These functions help us assess the accuracy, precision, recall, F1 score, and AUC of our models to evaluate the effectiveness of logistic regression models in predicting binary outcomes.

# Function to compute accuracy, precision, recall, and F1 score
performanceMeasures <- function(pred, truth, name = "model") {
   ctable <- table(truth = truth, pred = (pred > 0.5))
   accuracy <- sum(diag(ctable)) / sum(ctable)
   precision <- ctable[2, 2] / sum(ctable[, 2])
   recall <- ctable[2, 2] / sum(ctable[2, ])
   f1 <- 2 * precision * recall / (precision + recall)
   data.frame(model = name, precision = precision,
              recall = recall,
              f1 = f1, accuracy = accuracy)
}
# Function to generate a well-formatted table comparing model performance on training and test datasets
pretty_perf_table <- function(model,training,test) {
  library(pander)
  # setting up Pander Options
  panderOptions("plain.ascii", TRUE)
  panderOptions("keep.trailing.zeros", TRUE)
  panderOptions("table.style", "simple")
  perf_justify <- "lrrrr"

  # comparing performance on training vs. test
  pred_train <- predict(model, newdata=dTrain)
  truth_train <- dTrain[, "status"]
  pred_test <- predict(model, newdata=dTest)
  truth_test <- dTest[, "status"]

  trainperf_tree <- performanceMeasures(
      pred_train, truth_train, "logistic, training")

  testperf_tree <- performanceMeasures(
      pred_test, truth_test, "logistic, test")

  perftable <- rbind(trainperf_tree, testperf_tree)
  pandoc.table(perftable, justify = perf_justify)
}
# Function to calculate and display a confusion matrix to visualize the model's classification results
confusion_matrix <- function(ytrue, ypred) {
  ctable <- table(truth = ytrue, prediction = (ypred > 0.5))
  print(ctable)
}
#Function to compute AUC
calculate_auc <- function(model, data, outcome_colname) {
  predictions <- predict(model, newdata=data, type="response")
  pred <- prediction(predictions, data[[outcome_colname]])
  perf <- performance(pred, "auc")
  auc <- as.numeric(perf@y.values[[1]])
  return(auc)
}
pretty_perf_table(model.boruta, dTrain, dTest)
## 
## Attaching package: 'pander'
## The following object is masked from 'package:shiny':
## 
##     p
## 
## 
## model                  precision   recall       f1   accuracy
## -------------------- ----------- -------- -------- ----------
## logistic, training        0.7895   0.5674   0.6602     0.6973
## logistic, test            0.7568   0.6222   0.6829     0.7320
pretty_perf_table(model.fisher, dTrain, dTest)
## 
## 
## model                  precision   recall       f1   accuracy
## -------------------- ----------- -------- -------- ----------
## logistic, training        0.7954   0.5697   0.6639     0.7010
## logistic, test            0.7500   0.6000   0.6667     0.7216

While both the Boruta-selected and Fisher’s Score-selected models showcase moderate performance, they demonstrate minimal disparities in the performance evaluation metrics. Both models illustrates good performance with each containing an accuracy of approximately 70% on the test set, underscoring their ability to accurately classify unseen data. Despite the marginal differences, the model employing features from the Boruta selection has a slightly higher accuracy rate, at 73% on the test set. Moreover, the precision scores for both models are notably high, suggesting a high accuracy when predicting the positive class, resulting in only a few false positives. In terms of recall, both models show moderate performance, correctly identifying a significant portion data that belongs to a certain class, as shown in the recall values of 0.62 and 0.60 on the test sets. Furthermore, the F1-Scores at around 0.66 to 0.68, indicates a harmonic mean of precision and recall.

confusion_matrix(dTest$status, dTest$logpred1 > 0.5)
##      prediction
## truth FALSE TRUE
##     0    32   20
##     1    13   32
confusion_matrix(dTest$status, dTest$logpred2 > 0.5)
##      prediction
## truth FALSE TRUE
##     0    34   18
##     1    13   32

The confusion matrices above, tell us that both models have a similar number of True Positives, True Negatives, False Positives, and False Negatives. This implies that the models correctly classify a similar number of positive and negative instances, indicating that they have a balanced classification capability.The number of False Positives is relatively low, which aligns with the high precision scores mentioned earlier.

auc_train <- calculate_auc(model.boruta, dTrain, "status")
auc_test <- calculate_auc(model.boruta, dTest, "status")
print(paste("AUC for training data: ", round(auc_train, 4), "; AUC for test data: ", round(auc_test, 4)))
## [1] "AUC for training data:  0.7987 ; AUC for test data:  0.7654"
auc_train <- calculate_auc(model.fisher, dTrain, "status")
auc_test <- calculate_auc(model.fisher, dTest, "status")
print(paste("AUC for training data: ", round(auc_train, 4), "; AUC for test data: ", round(auc_test, 4)))
## [1] "AUC for training data:  0.7975 ; AUC for test data:  0.762"

The AUC score of the test data for both models is slightly lower than the AUC score of the training data, which is an indication that the model may be slightly overfitting the training data. However, the difference is not substantial, indicating that the model’s performance remains reasonably consistent when applied to unseen data. Notably, both the training and test AUC scores are well above the threshold of 0.5, indicating that the model has discriminatory power and can effectively distinguish between the positive and negative classes.

Section 6 - XGBoost Model and LIME

In this section, we will explore the capabilities of XGBoost, a gradient boosting algorithm, and LIME, a powerful tool for model interpretation to provide concise explanations and practical use cases.

library(xgboost)
# Input:
# - variable_matrix: matrix of input data
# - labelvec: numeric vector of class labels (1 is positive class)
# Returns:
# - xgboost modelR
xgb = function(variable_matrix, labelvec) {
  cv <- xgb.cv(variable_matrix, label = labelvec,
               params=list(
                 objective="binary:logistic"
               ),
               nfold=5,
               nrounds=100,
               print_every_n=10,
               metrics="logloss")

  evalframe <- as.data.frame(cv$evaluation_log)
  NROUNDS <- which.min(evalframe$test_logloss_mean)

  model <- xgboost(data=variable_matrix, label=labelvec,
                   params=list(
                     objective="binary:logistic"
                   ),
                   nrounds=NROUNDS,
                   verbose=FALSE)

  model
}

The xgb function trains an XGBoost model for binary classification and perform cross-validation to optimize its hyperparameters. The purpose of this technique is to find the best combination of hyperparameters for a machine learning model. The model undergoes 100 rounds of boosting and during each round, the model corrects its previous errors and updates its predictions based on the knowledge accumulated from previous rounds. As a result, the log loss values for both the training and test data decrease progressively with each boosting round.

dTrain$status <- as.numeric(as.character(dTrain$status)) # xgboost requires numeric labels 
input <- as.matrix(dTrain[boruta.vars])
model <- xgb(input, dTrain$status)
## [1]  train-logloss:0.530653+0.002449 test-logloss:0.546941+0.006926 
## [11] train-logloss:0.176272+0.003009 test-logloss:0.289694+0.022813 
## [21] train-logloss:0.106269+0.002478 test-logloss:0.280672+0.028782 
## [31] train-logloss:0.069780+0.001416 test-logloss:0.291730+0.029878 
## [41] train-logloss:0.049826+0.000851 test-logloss:0.298976+0.031276 
## [51] train-logloss:0.038868+0.001246 test-logloss:0.306169+0.035693 
## [61] train-logloss:0.031483+0.000810 test-logloss:0.316650+0.039309 
## [71] train-logloss:0.026763+0.000815 test-logloss:0.323084+0.042292 
## [81] train-logloss:0.023415+0.000915 test-logloss:0.328537+0.046447 
## [91] train-logloss:0.021090+0.000909 test-logloss:0.335342+0.046107 
## [100]    train-logloss:0.019337+0.000734 test-logloss:0.339142+0.047221

The model achieves a log loss of approximately 0.30 to 0.36 on the test data, demonstrating its ability to make accurate predictions. This indicates that the model effectively learns from the data and enhances its predictive performance.

Lime Interpretation

The following code uses the lime function to build an explainer, which takes the feature columns of the training dataset, the model, and puts the continuous variables into 10 bins when making explanations.

library(lime)
## 
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
## 
##     explain
explainer <- lime(dTrain[boruta.vars], model = model, 
                  bin_continuous = TRUE, n_bins = 10)
## Warning: subscribers_for_last_30_days_isBAD does not contain enough variance to
## use quantile binning. Using standard binning instead.
## Warning: video_views_for_the_last_30_days_isBAD does not contain enough
## variance to use quantile binning. Using standard binning instead.
cases <- c(3,11,24,49)
(example <- dTest[cases,boruta.vars])
##       uploads Country video_views_for_the_last_30_days
## 30  22043.552      47                        617638209
## 110 25047.792      17                        180374299
## 252  4707.735      47                         19409274
## 546  2301.165      47                        175430062
##     subscribers_for_last_30_days created_year
## 30                     128950.13            5
## 110                    152343.48            8
## 252                    -97591.25            4
## 546                    295223.40           15
##     subscribers_for_last_30_days_isBAD video_views_for_the_last_30_days_isBAD
## 30                                   1                                      1
## 110                                  1                                      1
## 252                                  1                                      1
## 546                                  2                                      1
explanation <- lime::explain(example, explainer, n_labels = 1, n_features = 7)
ggsave("plot1.png", plot = plot_features(explanation), width = 12, height = 6)

Plot From the results of these random cases, the LIME plot illustrates an understanding of how different features influence the model’s decision, highlighting that not all intuitively positive metrics such as high views or subscribers, support high earning possibilities. If we look at case 30, it is classified as a high earner with 85% probability with support predictors uploads and video_views_for_the_last_30_days_is_BAD for the high earner classification. While its contradicting features suggests that a significant number of video views, new subscribers, country (in this case United States), and channel’s age the high earner status. Overall, we can see from the plot that uploads is the determining feature for the classification of these four instances. The rest of the features are less significant.

input <- as.matrix(dTrain[fisher.vars])
model <- xgb(input, dTrain$status)
## [1]  train-logloss:0.532536+0.002052 test-logloss:0.552204+0.006971 
## [11] train-logloss:0.174255+0.007759 test-logloss:0.303433+0.019853 
## [21] train-logloss:0.105810+0.008274 test-logloss:0.288477+0.038694 
## [31] train-logloss:0.068876+0.004367 test-logloss:0.291952+0.047568 
## [41] train-logloss:0.049067+0.002822 test-logloss:0.305477+0.059234 
## [51] train-logloss:0.037897+0.002131 test-logloss:0.315562+0.062005 
## [61] train-logloss:0.030824+0.001700 test-logloss:0.322023+0.066567 
## [71] train-logloss:0.025966+0.001228 test-logloss:0.326077+0.069598 
## [81] train-logloss:0.022798+0.001031 test-logloss:0.334768+0.071992 
## [91] train-logloss:0.020458+0.000954 test-logloss:0.341398+0.073774 
## [100]    train-logloss:0.018802+0.000758 test-logloss:0.345324+0.075844

The model achieves a log loss of approximately 0.31 to 0.39 on the test data, slightly higher than that of the log loss from the xgboost model that used the features generated by Boruta. While it is slightly higher, the range of log-loss is still relatively low, which indicates that it is still making reasonably accurate probabilistic predictions.

explainer <- lime(dTrain[fisher.vars], model = model, 
                  bin_continuous = TRUE, n_bins = 10)
## Warning: subscribers_for_last_30_days_isBAD does not contain enough variance to
## use quantile binning. Using standard binning instead.
## Warning: video_views_for_the_last_30_days_isBAD does not contain enough
## variance to use quantile binning. Using standard binning instead.
cases <- c(3,11,24,49)
(example <- dTest[cases,fisher.vars])
##     video.views   uploads Country video_views_for_the_last_30_days
## 30   2638870777 22043.552      47                        617638209
## 110  4126755625 25047.792      17                        180374299
## 252 11488891033  4707.735      47                         19409274
## 546 48990569078  2301.165      47                        175430062
##     subscribers_for_last_30_days subscribers_for_last_30_days_isBAD
## 30                     128950.13                                  1
## 110                    152343.48                                  1
## 252                    -97591.25                                  1
## 546                    295223.40                                  2
##     video_views_for_the_last_30_days_isBAD
## 30                                       1
## 110                                      1
## 252                                      1
## 546                                      1
explanation <- lime::explain(example, explainer, n_labels = 1, n_features = 7)
ggsave("plot2.png", plot = plot_features(explanation), width = 12, height = 6)

Plot Similar to the LIME plot generated from the feature combination 1 identified by Boruta, the feature set derived from Fisher’s Score also produced comparable results.

Conclusion

Classification is a supervised learning technique used to categorize data points into predefined labels based on their features. Its purpose is to predict the predefined target label of a data point based on its characteristics, utilizing labeled training data to understand the relationships between different categories.

Our exploration into the classification of the given dataset provided significant insights. Starting from analyzing single variable models, it was evident that the features uploads and subscribers for the last 30 days stood out in performance as evidenced by their AUC scores and the ROC plot analysis. This observation was reinforced when we delved into multivariable classification models.

When we integrated the feature combinations generated by Boruta and Fisher’s score into different models, the results were highly similar. Specifically, in the logistic regression models, the 1% variance in accuracy and similar AUC scores between the two feature combinations suggests the consistent predictive power of the shared variables in both combinations. Meanwhile, the XGBoost model presented marginally better performance in the Boruta feature set over the Fisher’s set, but the differences are not large enough to definitively favor one over the other. The minimal discrepancy between the results might be due to the fact that the two combinations only differed by a single variable.

However, it is the Decision Tree model that offered a notable revelation. While both feature combinations produced almost identical trees, the standout finding was that just two features, uploads and subscribers for the last 30 days, were enough to yield an accuracy rate of 92%. This suggests that the other predictors might not be as influential.

Given the above findings, it becomes evident that the Decision Tree model, specifically when reduced to the two aforementioned significant predictors, emerges as the most efficient choice for classification in this scenario. The ability of the Decision Tree to achieve these results with fewer predictors highlights its efficiency and potential for scalability in future applications.

Part 2 - Clustering

We aim to cluster YouTubers based on their content creation habits, user engagement, and viewership statistics. We do not need a target variable for clustering. Our dataset consists of both numeric and categorical features, which we will convert into numeric values. Since these features have different scales, we will standardize them before using Euclidean Distance as the similarity measure for our clustering analysis.

Section 1 - Data Preparation

As part of the pre-processing step, we will clean and prepare the YouTuber names, as we intend to use them for visualizing the results of the clustering analysis.

df_gdp[grepl("^ý.*ý$", df_gdp$Youtuber), ]

There are 18 invalid Youtuber names (only contain “ý” symbol). To ensure the integrity of our data and meaningful analysis, we will exclude these invalid names from the dataset. Some Youtuber names still contain the “ý” and “�” symbol. To improve readability, we will replace these symbols with blanks.

df_gdp <- df_gdp %>%
  filter(!grepl("^ý.*ý$", Youtuber)) %>%
  mutate(Youtuber = gsub("ý", "", Youtuber)) %>%
  mutate(Youtuber = gsub("�", "", Youtuber))
# Convert the Country column to a be numeric column
country_mapping <- c(
  "Afghanistan" = 1, "Andorra" = 2, "Argentina" = 3, "Australia" = 4,
  "Bangladesh" = 5, "Barbados" = 6, "Brazil" = 7, "Canada" = 8,
  "Chile" = 9, "China" = 10, "Colombia" = 11, "Cuba" = 12,
  "Ecuador" = 13, "Egypt" = 14, "El Salvador" = 15, "Finland" = 16,
  "France" = 17, "Germany" = 18, "India" = 19, "Indonesia" = 20,
  "Iraq" = 21, "Italy" = 22, "Japan" = 23, "Jordan" = 24,
  "Kuwait" = 25, "Latvia" = 26, "Malaysia" = 27, "Mexico" = 28,
  "Morocco" = 29, "Netherlands" = 30, "Pakistan" = 31, "Peru" = 32,
  "Philippines" = 33, "Russia" = 34, "Samoa" = 35, "Saudi Arabia" = 36,
  "Singapore" = 37, "South Korea" = 38, "Spain" = 39, "Sweden" = 40,
  "Switzerland" = 41, "Thailand" = 42, "Turkey" = 43, "Ukraine" = 44,
  "United Arab Emirates" = 45, "United Kingdom" = 46, "United States" = 47,
  "Venezuela" = 48, "Vietnam" = 49)
df_gdp$Country <- country_mapping[df_gdp$Country]

# Convert the category column to a be numeric column
category_mapping <- c(
  "Autos & Vehicles" = 1, "Comedy" = 2, "Education" = 3, "Entertainment" = 4,
  "Film & Animation" = 5, "Gaming" = 6, "Howto & Style" = 7, "Movies" = 8,
  "Music" = 9, "News & Politics" = 10, "Nonprofits & Activism" = 11, "People & Blogs" = 12,
  "Pets & Animals" = 13, "Science & Technology" = 14, "Shows" = 15, "Sports" = 16,
  "Trailers" = 17, "Travel & Events" = 18)
df_gdp$category <- category_mapping[df_gdp$category]

df_gdp <- df_gdp %>%
  mutate(created_year = as.numeric(created_year)) %>%
  mutate(subscribers_for_last_30_days_isBAD = as.numeric(subscribers_for_last_30_days_isBAD)) %>%
  mutate(video_views_for_the_last_30_days_isBAD = as.numeric(video_views_for_the_last_30_days_isBAD))
df.clust <- df_gdp[, -c(1:2, 8, 15:16)] %>% # Exclude irreverent columns
  scale() %>% # Scale data
  as.data.frame()

Section 2 - Hierarchical Clustering

We will compare the behavior of complete, ward.D2, and single linkage methods to cluster YouTubers into clusters.

set.seed(123)
random_data <- df.clust[sample(nrow(df.clust), 100), ]
d.r <- dist(random_data, method = "euclidean")
pfit.c.r <- hclust(d.r, method = "complete")
pfit.w.r <- hclust(d.r, method = "ward.D2")
pfit.s.r <- hclust(d.r, method = "single")

plot(pfit.c.r, main="Cluster Dendrogram for Complete Linkage", cex = 0.6) 
rect.hclust(pfit.c.r, k=3)

plot(pfit.w.r, main="Cluster Dendrogram for Ward Linkage", cex = 0.6) 
rect.hclust(pfit.w.r, k=3)

plot(pfit.s.r, main="Cluster Dendrogram for Single Linkage", cex = 0.6) 
rect.hclust(pfit.s.r, k=3)

These random samples of data for the dendrogram plot provide limited insights and are challenging to interpret.

d <- dist(df.clust, method = "euclidean")
pfit.c <- hclust(d, method = "complete")
pfit.w <- hclust(d, method = "ward.D2")
pfit.s <- hclust(d, method = "single")
k <- 1:5
groups.c <- cutree(pfit.c, k = k)
groups.w <- cutree(pfit.w, k = k)
groups.s <- cutree(pfit.s, k = k)

groups.c.fp = as.data.frame(groups.c) %>%
  pivot_longer(cols = k, names_to = "cluster", values_to = "level")
conf.c = xtabs(~level + cluster, groups.c.fp)

groups.w.fp = as.data.frame(groups.w) %>%
    pivot_longer(cols = k, names_to = "cluster", values_to = "level")
conf.w = xtabs(~level + cluster, groups.w.fp)

groups.s.fp = as.data.frame(groups.s) %>%
    pivot_longer(cols = k, names_to = "cluster", values_to = "level")
conf.s = xtabs(~level + cluster, groups.s.fp)

cat("Complete Linkage: \n")
## Complete Linkage:
conf.c
##      cluster
## level   1   2   3   4   5
##     1 968  11   3   3   3
##     2   0 957 957 955 955
##     3   0   0   8   8   1
##     4   0   0   0   2   7
##     5   0   0   0   0   2
cat("Ward.D2 Linkage: \n")
## Ward.D2 Linkage:
conf.w
##      cluster
## level   1   2   3   4   5
##     1 968  13  13  13  13
##     2   0 955 630 630 596
##     3   0   0 325 281  34
##     4   0   0   0  44 281
##     5   0   0   0   0  44
cat("Single Linkage: \n")
## Single Linkage:
conf.s
##      cluster
## level   1   2   3   4   5
##     1 968 966 965   1   1
##     2   0   2   1 964 963
##     3   0   0   2   1   1
##     4   0   0   0   2   2
##     5   0   0   0   0   1

Since the dendrogram is challenging to interpret, we have decided to create a table showing the number of observations in each cluster. All three linkage methods exhibit distinct characteristics. Single linkage tends to keep observations isolated within each cluster, while ward.D2 aims to minimize the variance within the clusters.

k <- 5
groups.c <- cutree(pfit.c, k = k)
groups.w <- cutree(pfit.w, k = k)
groups.s <- cutree(pfit.s, k = k)
print_clusters <- function(df, groups, cols_to_print) { 
  Ngroups <- max(groups)
  for (i in 1:Ngroups) {
    print(paste("cluster", i))
    print(df[groups == i, cols_to_print]) 
  }
}
cols_to_print <- c("Youtuber", "subscribers", "highest_yearly_earnings", "Country") 
print_clusters(df_gdp, groups.c, cols_to_print)

For Hierarchical Clustering using complete linkage, we can observe that at level 1 of hclust with 3, 4, and 5 clusters, it remains the same with 3 observations. We can utilize the print_clusters() function to access and identify these observations. In this case, the observations are T-Series, Cocomelon - Nursery Rhymes, and SET India. We might assume that this group is based on a high number of subscribers, highest yearly earnings, and being from India. However, this interpretation is for exploratory purposes, and the clustering interpretation is highly subjective.

Visulize Cluster by PCA

princ <- prcomp(df.clust)
nComp <- 2
project2D <- as.data.frame(predict(princ, newdata=df.clust)[, 1:nComp])

hclust.c.project2D <- cbind(project2D, cluster=as.factor(groups.c), Youtuber=df_gdp$Youtuber)
hclust.w.project2D <- cbind(project2D, cluster=as.factor(groups.w), Youtuber=df_gdp$Youtuber)
hclust.s.project2D <- cbind(project2D, cluster=as.factor(groups.s), Youtuber=df_gdp$Youtuber)
# Finding convex hull
library(grDevices)
find_convex_hull <- function(proj2Ddf, groups) {
  do.call(rbind, 
          lapply(unique(groups),
                 FUN = function(c) {
                   f <- subset(proj2Ddf, cluster==c); 
                   f[chull(f),]
                 }
                ) 
          )
}
hclust.c.hull <- find_convex_hull(hclust.c.project2D, groups.c)
hclust.w.hull <- find_convex_hull(hclust.w.project2D, groups.w)
hclust.s.hull <- find_convex_hull(hclust.s.project2D, groups.s)
ggplot(hclust.c.project2D, aes(x=PC1, y=PC2)) +
  geom_point(aes(shape=cluster, color=cluster)) + 
  geom_text(aes(label=Youtuber, color=cluster), hjust=0, vjust=1, size=3) + 
  geom_polygon(data=hclust.c.hull, aes(group=cluster, fill=as.factor(cluster)), alpha=0.4, linetype=0) +
  ggtitle("Complete Linkage")

ggplot(hclust.w.project2D, aes(x=PC1, y=PC2)) +
  geom_point(aes(shape=cluster, color=cluster)) + 
  geom_text(aes(label=Youtuber, color=cluster), hjust=0, vjust=1, size=3) + 
  geom_polygon(data=hclust.w.hull, aes(group=cluster, fill=as.factor(cluster)), alpha=0.4, linetype=0) +
  ggtitle("Ward’s Linkage")

ggplot(hclust.s.project2D, aes(x=PC1, y=PC2)) +
  geom_point(aes(shape=cluster, color=cluster)) + 
  geom_text(aes(label=Youtuber, color=cluster), hjust=0, vjust=1, size=3) + 
  geom_polygon(data=hclust.s.hull, aes(group=cluster, fill=as.factor(cluster)), alpha=0.4, linetype=0) +
  ggtitle("Single Linkage")

When comparing complete and Ward linkage methods, it is observed that Ward linkage tends to form larger clusters, whereas complete linkage creates a combination of very small and very large clusters.

Evaluate Clustering Performance

library(fpc)
kbest.p <- 12
cboot.hclust.c <- clusterboot(df.clust, clustermethod = hclustCBI, method = "complete", k = kbest.p)
## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100
groups.cboot.c <- cboot.hclust.c$result$partition
cboot.hclust.w <- clusterboot(df.clust, clustermethod = hclustCBI, method = "ward.D2", k = kbest.p)
## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100
groups.cboot.w <- cboot.hclust.w$result$partition
cat("Complete Linkage: \n")
## Complete Linkage:
(values <- 1 - cboot.hclust.c$bootbrd/100)
##  [1] 0.88 0.34 0.54 0.17 0.62 1.00 0.95 0.69 0.30 0.84 0.50 0.53
cat("So clusters", order(values)[12], "and", order(values)[11], "are highly stable \n")
## So clusters 6 and 7 are highly stable
cat("Ward's Linkage: \n")
## Ward's Linkage:
(values <- 1 - cboot.hclust.w$bootbrd/100)
##  [1] 0.66 0.40 0.65 0.59 0.86 0.76 0.29 0.72 1.00 1.00 0.94 0.83
cat("So clusters", order(values)[12], "and", order(values)[11], "are highly stable \n")
## So clusters 10 and 9 are highly stable
# Function to return the squared Euclidean distance of two given points x and y
sqr_euDist <- function(x, y) {
    sum((x - y)^2)
}

# Function to calculate WSS of a cluster
wss <- function(clustermat) {
    c0 <- colMeans(clustermat)
    sum(apply( clustermat, 1, FUN=function(row) {sqr_euDist(row, c0)} ))
}

# Function to calculate the total WSS
wss_total <- function(scaled_df, labels) {
    wss.sum <- 0
    k <- length(unique(labels))
    for (i in 1:k) 
        wss.sum <- wss.sum + wss(subset(scaled_df, labels == i))
    wss.sum
}

# Function to calculate total sum of squared (TSS)
tss <- function(scaled_df) {
   wss(scaled_df)
}
# Function to return the CH indices computed using complete linkage
CH_index.c <- function(scaled_df, kmax, method="kmeans") {
    if (!(method %in% c("kmeans", "hclust"))) 
        stop("method must be one of c('kmeans', 'hclust')")

    npts <- nrow(scaled_df)
    wss.value <- numeric(kmax) # create a vector of numeric type
    # wss.value[1] stores the WSS value for k=1 (when all the
    # data points form 1 large cluster).
    wss.value[1] <- wss(scaled_df)

    if (method == "kmeans") {
        # kmeans
        for (k in 2:kmax) {
            clustering <- kmeans(scaled_df, k, nstart=10, iter.max=100)
            wss.value[k] <- clustering$tot.withinss
        } 
    } else {
        # hclust
        d <- dist(scaled_df, method="euclidean")
        pfit <- hclust(d, method="complete")
        for (k in 2:kmax) {
            labels <- cutree(pfit, k=k)
            wss.value[k] <- wss_total(scaled_df, labels)
        }
    }
    bss.value <- tss(scaled_df) - wss.value   # this is a vector
    B <- bss.value / (0:(kmax-1))             # also a vector
    W <- wss.value / (npts - 1:kmax)          # also a vector

    data.frame(k = 1:kmax, CH_index = B/W, WSS = wss.value)
}
# Function to return the CH indices computed using ward.D2 linkage
CH_index.w <- function(scaled_df, kmax, method="kmeans") {
    if (!(method %in% c("kmeans", "hclust"))) 
        stop("method must be one of c('kmeans', 'hclust')")

    npts <- nrow(scaled_df)
    wss.value <- numeric(kmax) # create a vector of numeric type
    # wss.value[1] stores the WSS value for k=1 (when all the
    # data points form 1 large cluster).
    wss.value[1] <- wss(scaled_df)

    if (method == "kmeans") {
        # kmeans
        for (k in 2:kmax) {
            clustering <- kmeans(scaled_df, k, nstart=10, iter.max=100)
            wss.value[k] <- clustering$tot.withinss
        } 
    } else {
        # hclust
        d <- dist(scaled_df, method="euclidean")
        pfit <- hclust(d, method="ward.D2")
        for (k in 2:kmax) {
            labels <- cutree(pfit, k=k)
            wss.value[k] <- wss_total(scaled_df, labels)
        }
    }
    bss.value <- tss(scaled_df) - wss.value   # this is a vector
    B <- bss.value / (0:(kmax-1))             # also a vector
    W <- wss.value / (npts - 1:kmax)          # also a vector

    data.frame(k = 1:kmax, CH_index = B/W, WSS = wss.value)
}
# complete linkage
crit.df <- CH_index.c(df.clust, 12, method = "kmeans")
fig1 <- ggplot(crit.df, aes(x=k, y=CH_index)) +
  geom_point() + geom_line(colour="red") + 
  scale_x_continuous(breaks=1:12, labels=1:12) +
  labs(y="CH index")
fig2 <- ggplot(crit.df, aes(x=k, y=WSS), color="blue") +
  geom_point() + geom_line(colour="blue") + 
  scale_x_continuous(breaks=1:12, labels=1:12)
grid.arrange(fig1, fig2, nrow=1)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).

# ward.D2 linkage
crit.df <- CH_index.w(df.clust, 12, method = "hclust")
fig1 <- ggplot(crit.df, aes(x=k, y=CH_index)) +
  geom_point() + geom_line(colour="red") + 
  scale_x_continuous(breaks=1:12, labels=1:12) +
  labs(y="CH index")
fig2 <- ggplot(crit.df, aes(x=k, y=WSS), color="blue") +
  geom_point() + geom_line(colour="blue") + 
  scale_x_continuous(breaks=1:12, labels=1:12)
grid.arrange(fig1, fig2, nrow=1)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 row containing missing values (`geom_line()`).

When using hierarchical clustering with the Euclidean distance metric and the complete linkage method, we find that the CH index is maximized at k = 2, while the ward.D2 linkage method results in k = 10. The WSS plot does not exhibit a clear elbow point.

Section 3 - K-means Clustering

kmClustering.ch <- kmeansruns(df.clust, krange = 1:12, criterion = "ch")
kmClustering.ch$bestk
## [1] 2
kmClustering.asw <- kmeansruns(df.clust, krange = 1:12, criterion = "asw")
kmClustering.asw$bestk
## [1] 2
kmCritframe <- data.frame(k=1:12, ch=kmClustering.ch$crit, asw=kmClustering.asw$crit)
fig1 <- ggplot(kmCritframe, aes(x=k, y=ch)) +
  geom_point() + geom_line(colour="red") + 
  scale_x_continuous(breaks=1:12, labels=1:12) +
  labs(y="CH index")
fig2 <- ggplot(kmCritframe, aes(x=k, y=asw)) +  
  geom_point() + geom_line(colour="blue") +
  scale_x_continuous(breaks=1:12, labels=1:12) +
  labs(y="ASW")
grid.arrange(fig1, fig2, nrow=1)

Using k-means clustering, CH index and AWS is maximized at k = 2 clusters.

fig <- c()
kvalues <- seq(2, 5)
for (k in kvalues) {
  set.seed(3064)
  groups <- kmeans(df.clust, k, nstart = 100, iter.max = 100)$cluster
  kmclust.project2D <- cbind(project2D, cluster = as.factor(groups), Youtuber = df_gdp$Youtuber)
  kmclust.hull <- find_convex_hull(kmclust.project2D, groups)
  assign(paste0("fig", k),
    ggplot(kmclust.project2D, aes(x = PC1, y = PC2)) +
    geom_point(aes(shape = cluster, color = cluster)) +
    geom_polygon(data = kmclust.hull, aes(group = cluster, fill = cluster),
                 alpha = 0.4, linetype = 0) + 
    labs(title = sprintf("k = %d", k)) +
    theme(legend.position ="none")
    )
}
grid.arrange(fig2, fig3, fig4, fig5,nrow = 2)

From these PCA plots, it’s evident that k-means clustering consistently forms a single cluster on the left-hand side. As we increase the value of k, it breaks down the cluster on the right-hand side into subclusters.

k_values <- 2:5
table <- matrix(0, nrow = length(k_values), ncol = max(k_values))
rownames(table) <- 2:5
colnames(table) <- 1:5

for (k_index in 1:length(k_values)) {
  set.seed(3064)
  k <- k_values[k_index]
  kmeans_result <- kmeans(df.clust, centers = k, nstart = 100, iter.max = 100)
  cluster_sizes <- table(kmeans_result$cluster)
  
  for (level in 1:k) {
    table[k_index, level] <- cluster_sizes[level] 
  }
}
cat("The number of observations in each cluster using k-means clustering \n")
## The number of observations in each cluster using k-means clustering
t(table)
##     2   3   4   5
## 1  57 599 292  46
## 2 911  31 599  31
## 3   0 338  31 328
## 4   0   0  46 271
## 5   0   0   0 292

Insight from clustering

We calculate the mean of each feature within each cluster to gain insights into the relationships between them.

set.seed(3064)
km <- kmeans(df.clust, 5, nstart = 100, iter.max = 100)
df_with_clusters <- df.clust %>% mutate(Cluster = km$cluster)
cluster_means <- df_with_clusters %>% group_by(Cluster) %>% summarise_all(mean)
ggplot(cluster_means) +
  geom_line(aes(x = Cluster, y = subscribers, linetype = "Subscribers", color = "Subscribers")) +
  geom_line(aes(x = Cluster, y = video.views, linetype = "Video Views", color = "Video Views")) +
  geom_line(aes(x = Cluster, y = uploads, linetype = "Uploads", color = "Uploads")) +
  geom_line(aes(x = Cluster, y = video_views_for_the_last_30_days, linetype = "Video Views Last 30 Days", color = "Video Views Last 30 Days")) +
  geom_line(aes(x = Cluster, y = highest_yearly_earnings, linetype = "Highest Yearly Earnings", color = "Highest Yearly Earnings")) +
  geom_line(aes(x = Cluster, y = subscribers_for_last_30_days, linetype = "Subscribers Last 30 Days", color = "Subscribers Last 30 Days")) +
  scale_x_continuous(breaks = 1:9, labels = 1:9) +
  labs(title = "Mean of each feature by cluster", x = "cluster", y = "mean") +
  scale_color_manual(values = c("Subscribers" = "blue", "Video Views" = "green", "Uploads" = "red", "Video Views Last 30 Days" = "purple", "Highest Yearly Earnings" = "orange", "Subscribers Last 30 Days" = "black")) +
  scale_linetype_manual(values = c("Subscribers" = "solid", "Video Views" = "dashed", "Uploads" = "dotted", "Video Views Last 30 Days" = "longdash", "Highest Yearly Earnings" = "twodash", "Subscribers Last 30 Days" = "solid"))

Cluster 2 exhibits higher mean values for each feature compared to the other clusters. In contrast, the means in cluster 3 are generally close to 0, while cluster 5 shows negative means. Given that the features are scaled using z-normalization, we can conclude that cluster 2 represents YouTubers who are highly famous, with above-average income, subscribers, and video views. Cluster 3 represents average YouTubers, while cluster 5 represents less popular or non-famous ones.

Conclusion

Clustering is an unsupervised learning technique used to group similar data points together based on their inherent similarities. Its purpose is to unveil natural patterns or groupings within the data without relying on a predefined target variable.

Hierarchical clustering proves to be particularly useful when the optimal number of clusters is unknown, allowing for an exploration of clustering behavior based on different distance metrics and linkage methods. The choice of distance and linkage depends on the characteristics of the data and the objectives of the analysis. In contrast, K-means clustering requires the user to specify the desired number of clusters beforehand.

Our clustering results exhibit a significant difference between the optimal choice of k = 10 from hierarchical clustering and k = 2 K-means clustering. It represents a trade-off between complexity and interpretability. Higher values of k can lead to more complex clusters, but the interpretation and insights may become challenging to discern. Conversely, a smaller k, such as k = 2, is easier to interpret but may oversimplify the underlying patterns, potentially leading to a loss of valuable information.

Refferences

Digital signature